Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exclude fsdp from delay_optimizer_creation #34140

Merged
merged 34 commits into from
Oct 28, 2024

Conversation

eljandoubi
Copy link
Contributor

What does this PR do?

It passes the model and the optimizer to accelerate.prepare in order to enable fp8 mixed precision, if any.

Fixes #34024

Who can review?

Library:

-->

Copy link
Contributor

@muellerzr muellerzr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice :) Can we add a test in tests/test_trainer.py? We can set env variables to configure Accelerate properly (ACCELERATE_MIXED_PRECISION="fp8" will auto-use TE)

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@eljandoubi
Copy link
Contributor Author

eljandoubi commented Oct 14, 2024

The required tests are distributed tests. We need to verify FSDP functionality with and without FP8 mixed precision. The appropriate test file might be tests/trainer/test_trainer_fsdp.py.

@eljandoubi
Copy link
Contributor Author

Is the test/trainer folder included in the CI tests? Where can I check the results for test_trainer_fsdp.py? @muellerzr @SunMarc

@muellerzr
Copy link
Contributor

muellerzr commented Oct 15, 2024

@eljandoubi we can't run them on the normal CI since GPU runners are not part of PR's.

Instead when ready I'll pull the PR down and run it myself

@eljandoubi
Copy link
Contributor Author

@muellerzr Thank you for the information. I have tested the branch in my code on a multi-node, multi-GPU setup using FSDP mode, both with and without FP8 mixed precision, and it worked as expected. Please let me know if you encounter any issues on your end.

eljandoubi referenced this pull request Oct 16, 2024
* Fix FSDP Initialization for resume training

* Added init_fsdp function to work with dummy values

* Fix FSDP initialization for resuming training

* Added CUDA decorator for tests

* Added torch_gpu decorator to FSDP tests

* Fixup for failing code quality tests
@eljandoubi
Copy link
Contributor Author

@muellerzr Any updates regarding this PR?

Copy link
Contributor

@muellerzr muellerzr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Can you do pip install -e .[quality] followed by make fixup? I'll then pull it locally to test on my 4090 system and we should be set!

@eljandoubi
Copy link
Contributor Author

eljandoubi commented Oct 25, 2024

@muellerzr I have done make fixup.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't worry we'll merge as is, failing tests are unrelated!

@ArthurZucker ArthurZucker merged commit 8b3b9b4 into huggingface:main Oct 28, 2024
18 of 22 checks passed
BernardZach pushed a commit to BernardZach/transformers that referenced this pull request Dec 5, 2024
* exclude fsdp from delay_optimizer_creation

* add test case for trainer: FSDP mode and fp8 as mixed precision

* rearrange imports

* ruff formatted

* adapt _init_fsdp to fp8

* use _init_fsdp only when resume_from_checkpoint

* In case of FDP, self.layer will be CheckpointWrapper which has no len() method

* delete _init_fsdp

* solve conflict

* fix conflict

* make fixup
BernardZach pushed a commit to innovationcore/transformers that referenced this pull request Dec 6, 2024
* exclude fsdp from delay_optimizer_creation

* add test case for trainer: FSDP mode and fp8 as mixed precision

* rearrange imports

* ruff formatted

* adapt _init_fsdp to fp8

* use _init_fsdp only when resume_from_checkpoint

* In case of FDP, self.layer will be CheckpointWrapper which has no len() method

* delete _init_fsdp

* solve conflict

* fix conflict

* make fixup
@muellerzr muellerzr mentioned this pull request Dec 11, 2024
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants